Chi-Square Classifier for Document Categorization
نویسندگان
چکیده
The problem of document categorization is considered. The set of domains and the keywords specific for these domains is supposed to be selected beforehand as initial data. We apply the well-known statistical hypothesis test that considers images of documents and domains as normalized vectors. In comparison with existing methods, such approach allows to take into account a random character of initial data. The classifier is developed in the framework of Document Investigator software package.
منابع مشابه
A Comparative Study on Different Types of Approaches to Bengali document Categorization
Learning. Abstract: Document categorization is a technique where the category of a document is determined. In this paper three well-known supervised learning techniques which are Support Vector Machine(SVM), Naïve Bayes(NB) and Stochastic Gradient Descent(SGD) compared for Bengali document categorization. Besides classifier, classification also depends on how feature is selected from dataset. F...
متن کاملChi Square Feature Extraction Based Svms Arabic Language Text Categorization System
This paper aims to implement a Support Vector Machines (SVMs) based text classification system for Arabic language articles. This classifier uses CHI square method as a feature selection method in the pre-processing step of the Text Classification system design procedure. Comparing to other classification methods, our system shows a high classification effectiveness for Arabic data set in term ...
متن کاملChi-square-based scoring function for categorization of MEDLINE citations
OBJECTIVES Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. METHODS Our procedure requires construction of a genetic and a nongenetic domain document corpus. W...
متن کاملNaïve Bayesian Based on Chi Square to Categorize Arabic Data
Text classification is a supervised technique that uses labelled training data to learn the classification system and then automatically classifies the remaining text using the learned system. This paper investigates Naïve Bayesian algorithm based on Chi Square features selection method. The base of our comparisons are macro F1, macro recall and macro precision evaluation measures. The experime...
متن کاملStatistical Feature Selection Techniques for Arabic Text Categorization
This paper compares a few statistical feature selection techniques for Arabic text. Feature selection is especially important for text classification because, when dealing with text, the number of features/words increases rapidly. This makes the document-term matrix a sparse one which affects the performance of classifiers in terms of accuracy and in terms of processing time. One opts to reduce...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001